By: Victoria Engler
Description: I recently decided I wanted to pursue Data Science full time. Coming from a background of Data Analytics and Python programming, I remembered we tried this project out in one of my first DS courses.
I couldn't remember if we ever actually completed it end to end, but I hoped to use this today to re-teach myself some of the DS fundamentals and strengthen my ever-growing skillset and knowledge.
For those who are unaware, the Titanic dataset challenge is a popular beginner's project, the challenge is to create a classifier that will predict whether or not an individual died or survived in the wreck. Going through various models, I worked to decide if one was overfitted, underfitted, if it was even the right one to use. I learned from this project that there's not set answer in DS and that there are so many different methods and resources out there to learn from that will only help me ask tougher questions about the data to be as accurate as possible. In the end, I am still left with questions and plan to keep tuning the below models to help me decide what the best path of success is in any challenging scenario.
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from mlxtend.classifier import StackingCVClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.tree import DecisionTreeClassifier
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score
from sklearn.base import BaseEstimator, TransformerMixin
import warnings
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn import model_selection
from sklearn import metrics
warnings.filterwarnings("ignore")
train=pd.read_csv('train.csv')
test= pd.read_csv ('test.csv')
target=pd.read_csv('gender_submission.csv')
entire_df=train.append(test)
entire_df.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1.0 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1.0 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1.0 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0.0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
train.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
X_test= test[['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']]
y_test=target['Survived']
X_train = train[['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']]
y_train=train ['Survived']
X_train.isna().sum()
PassengerId 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
It's always interesting to observe the way the world was back then and compare it to now. I was interested below to see what the fare differences were by both class and sex.
classBySex=entire_df.groupby(['Sex', 'Pclass'])['Fare'].mean().unstack()
fig = px.bar(classBySex)
fig.show()
classBySex
| Pclass | 1 | 2 | 3 |
|---|---|---|---|
| Sex | |||
| female | 109.412385 | 23.234827 | 15.324250 |
| male | 69.888385 | 19.904946 | 12.415462 |
It's also interesting how females had an average higher fare, makes me wonder if the higher fare was due to the lower amount of women and children.
I'm also curious about the number of kids and how much they cost based on sex. The number of kids in each class were pretty consistent.
f=entire_df[(entire_df['Age']<18)]
df=f.groupby(['Sex'])['Pclass'].value_counts().unstack()
print(df)
px.bar(df)
Pclass 1 2 3 Sex female 8 18 46 male 7 15 60
The fare for boys was slightly higher in all classes
kidsFareBySex=entire_df[entire_df['Age']<18].groupby(['Sex', 'Pclass'])['Fare'].mean().unstack()
fig = px.bar(kidsFareBySex)
fig.show()
There were significantly more men then women in all classes, except class.
f=entire_df[(entire_df['Age']>18)]
df=f.groupby(['Sex'])['Pclass'].value_counts().unstack()
#print(df)
px.bar(df)
In first class, women surprisingly cost more than men.
adultsFareBySex=entire_df[entire_df['Age']>18].groupby(['Sex', 'Pclass'])['Fare'].mean().unstack()
fig = px.bar(adultsFareBySex)
fig.show()
Creating the preprocessor and pipeline for the initial analysis
numeric_features = ["Age", "Fare"]
numeric_transformer = Pipeline(
steps=[("imputer", SimpleImputer(strategy="mean")), ("scaler", StandardScaler())]
)
categorical_features = ["Embarked", "Sex", "Pclass"]
categorical_transformer = OneHotEncoder(handle_unknown="ignore")
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features),
]
)
# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(
steps=[("preprocessor", preprocessor), ("classifier", LogisticRegression() )]
#this is the stacked one below)]
)
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))
model score: 0.955
y_pred=clf.predict(X_train)
accuracy_score(y_true=y_train, y_pred=y_pred)
0.7912457912457912
Defining 4 different models, just to take a look at how they do and see if I can adjust any of the hyper params
classifier1 = Ridge()
classifier2 = LogisticRegression()
classifier3 = DecisionTreeClassifier()
# Initializing Random Forest classifier
classifier4 = RandomForestClassifier( n_estimators=250,
criterion='gini',
max_depth=10,
max_features=0.3,
min_samples_split=3,random_state=1)
classifiers = {'Ridge': classifier1,
"LGC": classifier2,
"DT": classifier3,
"RF": classifier4}
classification_scores={}
for classifier in classifiers:
key=classifiers[classifier]
clf = Pipeline(
steps=[("preprocessor", preprocessor), ("classifier", key )]
#this is the stacked one below)]
)
clf.fit(X_train, y_train)
score=clf.score(X_test, y_test)
y_pred=clf.predict(X_train)
cv_scores = cross_val_score(clf, X_train, y_train, cv=5)
classification_scores[classifier] = (f'Score: {score}', f'CV Score: {cv_scores}')
classification_scores
{'Ridge': ('Score: 0.6824745596095751',
'CV Score: [0.31460216 0.38270739 0.38369443 0.32383759 0.43457676]'),
'LGC': ('Score: 0.9545454545454546',
'CV Score: [0.78212291 0.81460674 0.78089888 0.76966292 0.80337079]'),
'DT': ('Score: 0.8133971291866029',
'CV Score: [0.73184358 0.76966292 0.80337079 0.76404494 0.80337079]'),
'RF': ('Score: 0.8755980861244019',
'CV Score: [0.82681564 0.79775281 0.85955056 0.8258427 0.84831461]')}
Clearly, DecisionTreeClassifier and LogisticRegression performed the best at first glance, now to dig deeper below
DTparams = {"classifier__criterion": ["gini", "entropy"],
"classifier__max_depth": [3, 7, 10],
"classifier__min_samples_split": np.logspace(-3,3,20),
'preprocessor__cat__handle_unknown': ['ignore']}
LRparams = {"classifier__verbose": [0,1, 4],
"classifier__n_jobs":[-1,3,5, None],
"classifier__fit_intercept": [True, False],
'preprocessor__cat__handle_unknown': ['ignore']}
LRclf = Pipeline(
[("preprocessor", preprocessor),
('classifier', LogisticRegression())]
)
LRgrid = GridSearchCV(estimator = LRclf,
param_grid = LRparams,
cv = 5,
verbose=True,
n_jobs=-1,
scoring='precision')
DTclf = Pipeline(
[("preprocessor", preprocessor),
("classifier", DecisionTreeClassifier(random_state=7, max_features='auto') )
])
DTgrid = GridSearchCV(estimator = DTclf,
param_grid = DTparams,
cv = 5,
verbose=True,
n_jobs=-1,
scoring='precision')
LRgrid.fit(X_train,y_train)
print(LRgrid.score(X_test,y_test))
print(LRgrid.best_params_)
best=LRgrid.best_score_
y_pred=LRgrid.predict_proba(X_test)
auc = metrics.roc_auc_score(y_test, y_pred[:,1])
# print(f"The AUC of the Logistic Regression classifier is {auc:.3f}")
print(f"The best score of the Logistic Regression Grid Search is {best}")
print('-----------------------------------------')
DTgrid.fit(X_train,y_train)
print(DTgrid.score(X_test,y_test))
print(DTgrid.best_params_)
y_pred=DTgrid.predict_proba(X_test)
best=DTgrid.best_score_
auc = metrics.roc_auc_score(y_test, y_pred[:,1])
# Print results
print(f"The best score of the Decision Tree Classifier is {best}")
Fitting 5 folds for each of 24 candidates, totalling 120 fits
0.9182389937106918
{'classifier__fit_intercept': True, 'classifier__n_jobs': -1, 'classifier__verbose': 0, 'preprocessor__cat__handle_unknown': 'ignore'}
The best score of the Logistic Regression Grid Search is 0.7441297423524584
-----------------------------------------
Fitting 5 folds for each of 120 candidates, totalling 600 fits
1.0
{'classifier__criterion': 'entropy', 'classifier__max_depth': 3, 'classifier__min_samples_split': 0.001, 'preprocessor__cat__handle_unknown': 'ignore'}
The best score of the Decision Tree Classifier is 0.9518355770602241
Seeing that DTC got a 1.0, is a clear sign to me that this is overfitted. However, arriving at the AUC (area under the curve) score of .95, I feel a bit more confident that my ability to properly predicts True and False might be okay.
DTclf = Pipeline(
[("preprocessor", preprocessor),
("classifier", DecisionTreeClassifier(criterion='entropy',
random_state=7,
max_features='auto',
max_depth=3,
min_samples_split= 0.001
) )
])
cv_scores = cross_val_score(DTclf, X_train, y_train, cv=50)
print(f'CV Score: {cv_scores}')
CV Score: [0.83333333 0.66666667 0.66666667 0.88888889 0.66666667 0.77777778 0.72222222 0.77777778 0.66666667 0.77777778 0.72222222 0.66666667 0.88888889 0.77777778 0.72222222 0.77777778 0.94444444 0.83333333 0.83333333 0.72222222 0.77777778 0.83333333 0.77777778 0.83333333 0.88888889 0.72222222 0.77777778 0.77777778 0.83333333 0.94444444 0.72222222 0.77777778 0.83333333 0.83333333 0.83333333 0.77777778 0.66666667 0.72222222 0.72222222 0.83333333 0.83333333 0.82352941 0.88235294 0.76470588 0.76470588 0.76470588 0.76470588 0.82352941 0.82352941 0.88235294]
fig = px.scatter(cv_scores)
fig.show()
LRclf = Pipeline(
[("preprocessor", preprocessor),
("classifier", LogisticRegression(fit_intercept=True, n_jobs= -1, verbose= 0) )
])
cv_scores = cross_val_score(LRclf, X_train, y_train, cv=50)
print(f'CV Score: {cv_scores}')
CV Score: [0.94444444 0.55555556 0.83333333 0.83333333 0.72222222 0.83333333 0.55555556 0.94444444 0.66666667 0.88888889 0.83333333 0.72222222 0.88888889 0.72222222 0.66666667 0.77777778 0.94444444 0.72222222 0.88888889 0.88888889 0.77777778 0.88888889 0.66666667 0.72222222 0.83333333 0.77777778 0.77777778 0.83333333 0.66666667 0.94444444 0.61111111 0.66666667 0.77777778 0.83333333 0.83333333 0.72222222 0.72222222 0.77777778 0.83333333 0.77777778 0.83333333 0.76470588 0.76470588 0.94117647 0.64705882 0.70588235 0.82352941 0.88235294 0.82352941 0.82352941]
A bit more variability with the Logistic Regression Classifier... will be looking more deeply at these results to truly interpret the best route, however, I feel confident in the Logistic Regression due to its consistency and accuracy score.
fig = px.scatter(cv_scores)
fig.show()
Next steps: